An improved algorithm for inferring mutational parameters from bar-seq evolution experiments

Background Genetic barcoding provides a high-throughput way to simultaneously track the frequencies of large numbers of competing and evolving microbial lineages. However making inferences about the nature of the evolution that is taking place remains a difficult task. Results Here we describe an algorithm for the inference of fitness effects and establishment times of beneficial mutations from barcode sequencing data, which builds upon a Bayesian inference method by enforcing self-consistency between the population mean fitness and the individual effects of mutations within lineages. By testing our inference method on a simulation of 40,000 barcoded lineages evolving in serial batch culture, we find that this new method outperforms its predecessor, identifying more adaptive mutations and more accurately inferring their mutational parameters. Conclusion Our new algorithm is particularly suited to inference of mutational parameters when read depth is low. We have made Python code for our serial dilution evolution simulations, as well as both the old and new inference methods, available on GitHub (https://github.com/FangfeiLi05/FitMut2), in the hope that it can find broader use by the microbial evolution community. Supplementary Information The online version contains supplementary material available at 10.1186/s12864-023-09345-x.

indicator variable for the establishment of a mutation occurring in lineage i Table S1: Definitions of commonly used quantities.

S2. Definition of fitness
Evolutionary theory commonly uses two different definitions of fitness: Malthusian fitness and Wrightian fitness. Let t denote the elapsed time (measured in generations), and n(t) denote the number of cells at time t. Malthusian fitness s m is defined as the exponential growth rate (per generation) of the population, and Wrightian fitness s w is defined as the number of offspring of a cell per generation. So for deterministic growth without death, these definitions imply n(t) = n(0)e smt ; (S1) The fitness of a mutation in our simulations is defined in the Wrightian sense, which is natural for discrete generations. The branching process in the theory for our inference algorithm uses fitness in the Malthusian sense, which arises naturally in continuous time. Malthusian and Wrightian fitness are related by s m = ln(1 + s w ) and are approximately equal when the fitness effect per generation is small, as is the case in our analyses. For fair comparison, we convert the mean fitness and fitness effect in the simulation from Wrightian fitness to Malthusian fitness when comparing the ground truth of the simulation to the inferred value.
In a well-mixed environment without ecological interactions, all cells in the population compete against the mean fitness. For an adaptive lineage i that originated from a single mutant cell, when the fitness of lineage, s i , is larger than the mean fitness, the lineage increases in frequency. Once the mean fitness passes the fitness of the lineage, the lineage begins to decrease in frequency. Over time, the mean fitness increases as fitter adaptive lineages become enriched. If the total population size is fixed and we track lineage frequencies {f i;k } at each timepoint subject to P i f i;k = 1, one can see that they obey the equation Table S1 ands(t) = P i s i f i (t), which varies continuously in time. Note that this definition ofs relies on the product of the individual fitnesses {s i } and the time between samples being small. When this is not true, it is still possible to jointly infer the individual lineage fitnesses without making use of the mean fitness (see Ref. [3]), but we have not pursued this here, in favor of explicitly accounting for the fraction of each lineage that carries a beneficial mutation (Section S5). This choice is likely to decrease the validity of our inference algorithm when mutations of large fitness effect arise. Equation S3 is used to estimate the average read number of a lineage conditioned on its read number at the previous timepoint -however since we only infer the mean fitness at timepoints that are sequenced, we interpolates linearly between these points of reference in order to calculate E. Under this approximation,

S3. Effective process
As discussed in the supplementary information (SI) of Ref. [1] and in Section S10, the process of population growth for g generations followed by 1 : 2 g dilution can be described as a time-homogeneous branching process if one chooses the correct effective values of n(t) (effective lineage size) and 2c (offspring number variance per-individual). Specifically, for lineage i, the process of a growth cycle (growth of g generations followed by 1 : 2 g dilution) from t k−1 to t k can be described as a single step of a branching process with effective lineage size n i;k given by g times the number of cells transferred at the dilution, and 2c given by the per-individual variance per growth cycle from the batch culture. Equivalently, one could define the effective size of each lineage as its actual size at bottleneck, and 2c reduced from the variance per cycle by a factor of g . In this work, we use the former convention.
In our simulated model with stochastic cell doubling and dilution noise, the per-individual variance in offspring number per growth cycle can be calculated using the law of total variance, and is 2c ≈ 2, with g generations of stochastic cell doubling introducing a variance of ≈ 1 (for g ≫ 1), and 1 : 2 g dilution introducing another variance of 1. Therefore, the value of c which describes the simulated process in terms of the branching process model is c = 1. However, in real data the value of c is unknown, and is in fact of biological interest. Inferring it from our data is a future goal (as mentioned in the Discussion of the main text). In the current framework, when we infer our fitnesses, setting c = 1 only affects the establishment size of a beneficial mutation during the inference process, and we therefore expect that it does not much change our results.

S4. Parameterizing noise
Barcoded evolution experiments involve two processes: 1) a series of batch cultures, during which in each cycle the population grows for about g generations and then is diluted by a factor of 2 g into the next batch, and 2) sampling and preparation of cells via genomic DNA extraction, PCR and sequencing. As discussed in the SI of [1], both processes introduce stochasticity in the number of reads per barcode at each timepoint. Dealing with this stochasticity is essential for detecting adaptive mutations and for estimating their mutational parameters.
As we have discussed, the first step (cell growth and transfer) can be modeled as the branching process (Section S3). The second step (DNA extraction, PCR and sequencing) is also effectively a branching process, now composed with the results of the first step. The stochasticity coming from these processes is essentially a consequence of the compounding of Poisson counting statistics over time, often referred to as birth-death stochasticity, and means that the variance of r i;k , Var(r i;k ), is proportional to its expected value, ⟨r i;k ⟩. The constant of proportionality is given by a phenomenological parameter 2» k , which captures all sources of noise (cell growth and dilution, DNA extraction, PCR and sequencing). Under this noise model, Var(r i;k ) = 2» k ⟨r i;k ⟩ for every lineage i. In order to determine the values of the » k for each timepoint t k , we first find all lineages with read number at t k−1 being r i;k−1 = q. Then we estimate the mean -and the variance ff 2 of r i;k conditioned on r i;k−1 = q. Then, conditional on q, » k = ff 2 =(2-). We average » k over all q ∈ [20; 30], which are likely to correspond to neutral lineages for typical read depths (see also the SI of Ref. [1]). Although we only determine the values of » k once and for all timepoints at the beginning of our inference algorithm, we have found that our results are quite insensitive to the values of the {» k } inferred.

S5. Algorithm details
Here we describe in detail the algorithm executed by FitMut2, which is similar to the algorithm implemented in Ref. [3], the main difference being the inference of establishment times from the lineage trajectories rather than from phylogenetic relationships observable from repeated barcoding. The basic premise is Bayes' theorem: we wish to know the probability that each lineage is adaptive, conditional on our observed lineage trajectory {r i;k }. In this and the subsequent section, we assume that the lineage index is always i, and thus suppress the i subscript. Bayes' theorem tells us that for a lineage with observed lineage trajectory {r k }, where Θ is an indicator variable that is 1 if the lineage contains an established mutation and 0 otherwise.
When " = 1, we can further decompose this adaptive hypothesis into an integral over all possible ways to be adaptive, with a prior distribution p(s; fi ) over s and fi . Thus, When " = 0 we use the prior P (Θ = 0) ≈ 1, since the vast majority of lineages do not obtain a beneficial mutation (or equivalently R p(s; fi )dsdfi ≪ 1). Therefore, we reduce the problem of inference to calculating: 1) the probability of the {r k } given the neutral hypotheses and 2) the probability of the {r k } given the adaptive hypothesis with a mutation having fitness effect s and establishment time fi . Together with a prior distribution over s and fi , this gives us the tools to assess the probability that a lineage is adaptive.
Note that we have split the hypothesis into a neutral and adaptive case, instead of inferring an s and fi for each lineage, for two reasons: The first is that this split allows us to save computation time, by only finding the most likely s and fi for those lineages deemed likely enough to be adaptive (which is a small fraction of all lineages). The second is that if we were to have a single prior distribution for s, which had support at 0, then some lineages might be assigned very small fitness effects, which may not reflect real beneficial mutations.
We now explain how to calculate the various terms needed to evaluate the likelihood of a lineage trajectory under the two different scenarios.

S5.1. Neutral hypothesis
To calculate p({r k }|Θ = 0), we make an important approximation. First we note that In other words, the effective cell number at each timepoint obeys a Markov process, and the distribution of n k only depends on the value of n k−1 . Note that strictly speaking, the analogous statement for the read numbers {r k } is false: The distribution of r k does not only depend on r k−1 , due to independent sequencing noise introduced at each timepoint. Strictly, for the read numbers {r k }, Nonetheless, we make the approximation that the joint distribution of the {r k } factorizes in the same way as that of the {n k }. Though we expect that this assumption become less accurate at low read depth, where the independent noise at each timepoint is larger, empirically our algorithm still seems to be accurate over the range of parameters we have tested. We additionally investigated an alternative method that does not make this assumption, and instead tries to jointly infer the sequence of cell numbers {n k } that is most likely to have produced our data. However, we have not pursued this method in the current work, due to its infeasibility under the adaptive hypothesis.
With the assumption of Markovian dynamics for the {r k }, we define the likelihood of our data given the neutral hypothesis as Here I 1 is the modified Bessel function of the first kind, and the notation ⟨r k |r k−1 ⟩ indicates the expected value of r k conditional on the value of r k−1 and Θ = 0. This value is given by the expression for the decline of a neutral lineage in the presence of an increasing mean fitness, We can then take the product of these conditional distributions as specified in Equation S9, in order to get the likelihood of our data under the neutral hypothesis. However we must deal separately with the first timepoint which gives us a factor of p(r 0 ). This is discussed below.

S5.2. Adaptive hypothesis
For the adaptive hypothesis, we have When an adaptive mutation occurs and establishes in a lineage, the lineage can contain both mutant and neutral cells which are indistinguishable on the basis of barcode alone. Before the establishment, the neutral cells comprise almost the entire lineage, while after an adaptive mutation establishes, the mutant cells begin to sweep through the lineage.
In order to calculate the likelihood of our data under the adaptive hypothesis, we use Equation S10, but condition on a value of s and fi rather than Θ = 0, and with the conditional expectations ⟨r k |r k−1 ⟩ calculated under the assumption of a given s and fi (Equation S13). Assuming that an adaptive mutation with fitness effect s establishes at time fi in a lineage, the average read number of a lineage at timepoint t k is Here, 1 {t>fi } is an indicator variable that is 1 if t > fi and 0 otherwise, and we have used the fact that the establishment size of a lineage beyond which it grows deterministically (as if it had started at this size at its establishment time) is given by c s−s(fi ) .
We assume that n mut Furthermore, we impose a lower cutoff on s −s(fi ) to avoid very large establishment sizes.
Since we also want to calculate likelihoods for fi < 0 (corresponding to mutations that occurred during the pre-growth phase), we need the mean fitness before t = 0 when calculating E t k fi : We assume that the mean fitness is 0 before the experimental evolution begins.
As in the case of the neutral hypothesis, we also need to deal with the first sequencing timepoint t 0 , for which we do not have a previous timepoint on which to condition. Therefore for the first timepoint we define p(r 0 |Θ = 0) = p(r 0 |s; fi ) = p(r 0 ); (S14) Here, ⟨r 0 ⟩ = r 0 . In other words, the mean of r 0 is approximated by the measured value of r 0 for each lineage. » 0 is assumed to be 2:5, which is empirically a typical value. Estimating the most likely value of n 0 and the resulting distribution of r 0 from the data is an interesting problem which we leave for future work.
Therefore, under the adaptive hypothesis, the probability of observing our data is given by Therefore, to calculate the probability that a lineage is adaptive, we numerically integrate F (s; fi ) for that lineage over s and fi . Since this calculation for each lineage is independent, it can be readily parallelized (see Discussion in the main text).
If P (Θ = 1|{r k }) > 0:5, we conclude that the lineage contains an established mutation -though this threshold can be changed depending on the use case. The fitness effect and establishment time that we infer for this lineage are given by the s and fi that maximize the logarithm of posterior likelihood, ln F (s; fi ). This maximization is discussed in the next section.

S5.3. Numerical optimization
The fitness effects and establishment times of adaptive mutations are inferred by optimizing the posterior likelihood (Section S5) over the s and fi . There are several choices for how to do this optimization. The most straightforward is to perform a direct search: evaluating ln F (s; fi ) on a grid of points and finding the s and fi which yield the maximum value. This is the default mode of operation of our algorithm, and works due to the low dimensionality of our search space. However we have also provided alternate gradient-free optimization routines, such as the Nelder Mead algorithm and the differential evolution algorithm. The results presented in this work use the Nelder Mead optimization algorithm. Figure S5 shows the landscape of the log-posterior likelihood for an adaptive lineage in which an adaptive mutation established, as well as the landscape of the log-posterior likelihood for a neutral lineage. We see that F i (s; fi ) is not particularly sharply peaked in all directions -in fact its density is strongly concentrated along a one-dimensional curve in two dimensions. Though the Nelder mead algorithm accurately finds the maximum in the likelihood landscape, there is some mismatch between the location of this maximum and the true values of the fitness and occurrence times.

S5.4. Updating the mean fitness
Since the algorithm proceeds iteratively, we need a rule for updating the mean fitness after the fitnesses of individual lineages are inferred. Specifically, at the end of each iteration, we estimate the effective number of mutant cells for each lineage, n mut i;k , for each timepoint t k , given the s i and fi i that were inferred. Then we update the mean fitness ass We can now proceed by re-inferring the individual lineage fitnesses given our updated mean fitness.

S6. Error estimation
In addition to estimating s and fi for each adaptive lineage, we provide uncertainties of our estimates, which can be calculated via the Hessian matrix of ln F , evaluated at maximal point (s 0 ; fi 0 ) found by our optimization algorithm. The Hessian is given by The eigenvalues of H give information about how sharp of a maximum in ln F we have found -if the maximum is very shallow we are more uncertain of its location, and therefore of our estimates of s and fi . Following Ref. [1], we calculate the eigenvectorsṽ 1 ;ṽ 2 and eigenvalues -1 ; -2 of H. By projecting the eigenvectors along the s and fi directions of our parameter space and using the fact that F is locally Gaussian with scale 1= √ i in the directioñ v i , we define the error in s to be max(ṽ 1 ·s= √ -1 ;ṽ 2 ·s= √ -2 ), and the error in fi to be max(ṽ 1 ·fi = √ -1 ;ṽ 2 ·fi = √ -2 ), wheres = (1; 0) andfi = (0; 1) are the unit vectors respectively.

S7. Effect of prior
The choice of prior-(s) affects the landscape over which we optimize ln F (s; fi ). Ignoring logarithmic terms, which have a negligible effect on the location of the maximum of F , we can expand our objective function as in the neighborhood of the optimum, where x is the optimal s that would be found under a uniform prior. Therefore if we use a prior-(s) ∼ e −s=-, we find that the optimal s where F is extremized shifts by an amount − 1 2¸-. If the lineage does not contribute much to the mean fitness, this shift in the fitness of lineage does not affect the optimization landscape and the prior that we have chosen simply shiftsŝ by an amount that depends inversely on¸and -.
From Figure S5 we estimate that¸≈ 20=:05 2 = 8000. Therefore varying -from 0:1 (its value in this work) within an order of magnitude would have little effect onŝ.

S8. DFE estimation
In this work, in addition to inferring the fitness effects of individual mutations, we have also inferred the distribution of fitness effects -(s). Since the fitness effect of a mutation affects how likely it is to be seen, we need a way to get -(s) from the observed fitness effects.
where the first term on the right side accounts for mutations during the evolution, and the second term accounts for mutations that arose during pre-growth. N f = 10 12 is the maximum population size after barcoding, before the separation of the replicates.
Thus, the distribution of fitness effect (DFE), -(s), satisfies the equation To solve -(s) from equation S23, let Here, Equation S23 is an approximation that does not account for the mean fitness. Therefore, to infer -(s) from our results, we choose the timepoint t = 32, at which time the mean fitness is typically still small.

S9. Simulation details
We have written simulations to produce data on which we have tested our algorithm, as a means to compare its results against a ground truth. The code for these simulations is included on GitHub along with our inference algorithm. In our simulations, during 16 generations of pre-growth, the number of offspring of a single cell with fitness s is distributed as Pois(2(1 + s)). Here, Pois(-) represents Poisson distribution with parameter -. This pre-growth allows lineage size to fluctuate, and introduces variability in the lineage size going into the pooled batch culture. At the end of the pre-growth phase, each cell i has grown into a colony of size n * i . To initialize the evolution experiment, an average of 100 cells per barcode are sampled: n i (0) cells from each lineage are sampled with whereL is the number of non-extinct colonies.L is then defined as the number of non-extinct lineages with n i (0) > 0.
During first batch culture cycle, growth noise is simulated by updating the number of descendants of a single cell according to After g generations, the cells which get transferred to the next batch are sampled randomly from the final population and are Poisson distributed with mean n i (g )=2 g .
Of the four DFEs in our simulation, two are truncated exponential distributions, that is, - ( In order to determine whether a mutation has established or not in simulation, we compare its instantaneous effective cell number n(t) to 2=(s −s(t)), and if it crosses this size we record it as established -this heuristic means that its probability of extinction would be about e −2 ifs stopped increasing thereafter. When we infer the mutational identity of lineages from experimental data, the establishment size used in our inference algorithm is c=(s −s). However, for real data we do not know a priori what the value of c is in experimental data, and our algorithm assumes that c = 1.

S10. Distribution of offspring from a single individual
The time-homogeneous birth-death process is a simple model to understand the growth of a lineage that is subject to both stochastic drift and systematic selection. Let n(t) denote a random variable that represents the number of individuals at the time t (where time is now taken to be continuous and in units of generations). Assume that each individual has a birth rate k b and death rate k d , and individuals grow or die independently of one another. In a small interval of time ‹t ≪ 1=k b ; 1=k d , each individual divides with probability k b ‹t, dies with probability k d ‹t, and does nothing with probability 1 − (k b + k d )‹t. Let B j and D j respectively denote Bernoulli random variables indicating birth or death in the interval [t; t + ‹t] for the jth individual. As done in Ref. [1], we define a moment generating function of this birth-death process with ⟨X⟩ Y denoting an average of the random variable X over the random variable Y . Since our branching process obeys the equation , we can derive a partial differential equation for this moment generating function r a n exp where a = n(0)e st and b = c s (e st − 1). Here, I 1 (x) denotes the modified Bessel function of the first kind ( ). The mean of this distribution is a and its variance is 2ab. For n large compared to b 2 =a, our distribution takes the limiting form is the theoretical distribution that FitMut1 used for the number of reads at the current timepoint t k conditioned on the previous timepoint t k−1 , where a is the mean number of reads at the current timepoint, and b is replaced by the » k . In FitMut2, we use Equation S30, which is more accurate for small read numbers. Equations S11 and S13 are used to calculate the expected number of reads corresponding to a particular barcode at the current timepoint, conditional on the read number at the previous timepoint, under neutral and adaptive hypotheses.
Our branching process can also tell us what the establishment size of a lineage is, above which it is destined to fix. Note that P (n(t) = 0) is the extinction probability by time t. We can calculate this probability directly from the moment generating function according to where the last approximation obtains when t ≫ 1=s, which corresponds to eventual probability of fixation. Therefore, the lineage of a single mutant cell (carrying an adaptive mutation with fitness effect s ≪ c) goes extinct with probability ≈ 1 − s=c and establishes with probability s=c. Furthermore, once a lineage reaches size c=s, it is unlikely to go extinct. If we define an establishment size c=s then the establishment time fi is defined through the relation n(t) = c s e s(t−fi ) (with n(0) = 1 meaning that the mutation occurred at time 0), which yields a t-independent distribution for fi − t as t → ∞ [2,1]. fi − t is asymmetrically distributed around 0 with width scaling as 1=s, and fi − t much more likely to be large and positive than large and negative on this scale. However the most likely value of fi − t is 0, which allows us to compare the occurrence time from simulation to establishment time from inference.   , with a and 2ab being respectively the mean and variance of the simulated data. The theoretical read number distribution is fit to the observed distribution with parameters a and b. This indicates that our simulation of pre growth is well approximated by the critical branching process with no fitness differences between lineages.